Stochastic Gradient Descent

Motivation

Although batch methods are straight forward to get working provided a good off the shelf implementation because they have very few hyper-parameters to tune,

Tips

Use a minibatch rather than a single example

A typical minibatch size is 256, although the optimal size of the minibatch can vary for different applications and architectures.

Choose proper learning rate

There are many more advanced and sophisticated methods.

Shuffle the data prior to each epoch
avoids bias in the gradient and lead to better convergence.

Momentum

The objectives of deep architectures have a long shallow ravine leading to the optimum and steep walls on the sides. The negative gradient of standard SGD will point down one of the steep sides rather than along the ravine towards the optimum and thus can lead to very slow convergence.
Momentum is one method for pushing the objective more quickly along the shallow ravine. The momentum update is given by

vθ=γv+αθJ(θ;x(i),y(i))=θv

In the above equation v is the current velocity vector which is of the same dimension as the parameter vector θ. The learning rate α is as described above, although when using momentum α may need to be smaller since the magnitude of the gradient will be larger. Finally momentum γ(0,1] determines for how many iterations the previous gradients are incorporated into the current update. Generally γ is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher.

reference

http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/